Model Selection

Multimodal Large Model

# Multimodal Large Model

INFRL Qwen2.5 VL 72B Preview Ggufs Fully Quantized

An improved vision-language model based on Qwen2.5-VL-72B-Instruct, excelling in multiple visual reasoning benchmarks

Text-to-Image English

Finetune VQA 1B

A visual question answering model fine-tuned based on InternVL3-1B and Vintern-1B-v3_5, supporting Vietnamese, suitable for image content understanding and question-answering tasks.

Text-to-Image Other

Emova Qwen 2 5 3b

EMOVA is an end-to-end omni-modal large language model that supports visual, auditory, and speech functions, capable of generating text and speech responses with emotional control.

Multimodal Fusion

Transformers Supports Multiple Languages

Internvl3 1B Hf

InternVL3 is an advanced series of multimodal large language models, demonstrating exceptional multimodal perception and reasoning capabilities, supporting image, video, and text inputs.

Transformers Other

Internvl3 78B Pretrained

InternVL3-78B is an advanced multimodal large language model developed by OpenGVLab, demonstrating exceptional comprehensive performance. Compared to its predecessor InternVL 2.5, it possesses stronger multimodal perception and reasoning capabilities, extending its abilities to new domains such as tool usage, GUI agents, industrial image analysis, and 3D visual perception.

Transformers Other

Qwen2.5 Omni 7B GPTQ 4bit

A 4-bit GPTQ quantized version of the Qwen2.5-Omni-7B model, supporting multilingual and multimodal tasks.

Multimodal Fusion

Safetensors Supports Multiple Languages

Internvl 2 5 HiCo R16

InternVideo2.5 is a video multimodal large language model (MLLM) enhanced by long and rich context (LRC) modeling, built upon InternVL2.5.

Transformers English

Internvideo2 5 Chat 8B

InternVideo2.5 is a video multimodal large language model enhanced by Long and Rich Context (LRC) modeling, built upon InternVL2.5. It significantly improves existing MLLM models by enhancing the ability to perceive fine-grained details and capture long-term temporal structures.

Transformers English

Mplug Owl3 7B 241101

mPLUG-Owl3 is an advanced multimodal large language model that focuses on solving the problem of long image sequence understanding. It significantly improves the processing speed and sequence length support through the hyper attention mechanism.

Safetensors English

Llm Jp 3 Vila 14b

A large-scale vision-language model developed by Japan's National Institute of Informatics, supporting Japanese and English with strong image understanding and text generation capabilities.

Image-to-Text Japanese

Pixtral 12B Captioner Relaxed

An instruction-fine-tuned version based on the Pixtral-12B-2409 multimodal large language model, capable of generating richer detail descriptions for given images

Transformers English

mPLUG-DocOwl2 is an OCR-free multimodal large language model for multi-page document understanding, efficiently encoding document content via a high-resolution document compressor.

Safetensors English

ChartMoE is a multimodal large language model based on InternLM-XComposer2, featuring a mixture of experts connector with advanced chart capabilities.

Kangaroo is a powerful multimodal large language model specifically designed for long video understanding, supporting bilingual dialogue (Chinese-English) and long video inputs.

Transformers Supports Multiple Languages

Internlm Xcomposer2 Vl 7b

InternLM-XComposer2 is a vision-language large model developed based on InternLM2, featuring outstanding image-text understanding and creation capabilities.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase